#The Objective

Our data set consists of observations from the Large Hadron Collider at CERN over the year of 2012. It is ordered data of the form \(\{(\mathbf{x}_i,y_i,w_i)\}_{i \in D}\) where \(\mathbf{x} \in \mathbb{R}^{30}\); \(y \in \{\mathbf{b},\mathbf{s}\}\) and \(w \in [0,1]\) is a weight which measures the intensity of each data point.

Each \(\mathbf{x}_i\) is the collection of observables about each event and each \(y_i\) is the categorization of the event as either ‘Background’ \((= \mathbf{b})\) and ‘Signal’ \(=(\mathbf{s})\). We define \[ \mathcal{S} = \{i: y_i = \mathbf{s}\} \text{ and } \mathcal{B} = \{i: y_i =\mathbf{b}\} \] with \(n_s = \mid \mathcal{S} \mid\) and \(n_b = \mid \mathcal{B} \mid\) respectively. The weight variable is not to be used to train the classifier \(f\) in any way, it is only to compute the AMS, which is our measure of the accuracy of our classifier \(f\). Speaking of \(f\), we define \(\hat{f} = \{i: f(\mathbf{x}_i) = \mathbf{s}\}\), i.e the points labelled as signals by \(f\) and using this we define

\[ s = \sum_{i \in \mathcal{S}\cap \hat{f}}w_i \text{ and } b = \sum_{i \in \mathcal{B}\cap \hat{f}}w_i \]

i.e \(s\) and \(b\) are the weighted \(\textit{true positives}\) and \(\textit{false positives}\) respectively. Finally we define the AMS function is defined as follows:

\[ \text{AMS}(f) = \sqrt{2\left(s+b+b_{reg}\ln \left (1+ \frac{s}{b+b_{reg}} \right) -s \right)} \]

Where \(b\) and \(s\) are as above and \(b_{reg}\) is a normalizing constant. Notice that this is computed over a data set \(D\), so if we wish to compute this over some training or test set we must re-normalize the weights. We wrote a function adjust_weights to do this re-normalization, as well as a function approx_median_sig to calculate the resulting AMS value.

Our task is to create some classifier \(f\) which maximizes the AMS.

The Method

As we have \(31\)-dimensional data we will perform preliminary analysis of the variables before running classification. The following is a loose structure of our method for tackling this problem:

  1. Eliminate any obviously redundant variables such as meta-data.
  2. Analyse and deal with any missing data.
  3. Remove variables which have minimal impact on classification.
  4. Perform a classification algorithm.
  5. Study the impact on the AMS.

Preliminary Observations

Types of Data

We have a few types of data to consider:

  • Variables KaggleSet and KaggleWeight can be ignored as they denote which data points were in the provided Kaggle challenge and their relative weights, which is irrelevant to us.

  • The discrete classification variable Label which takes values in \(\{b,s\}\)

  • The continuous variable Weight, which will be used to compute the AMS, and will not be used in classification.

  • The discrete variable PRI_jet_num denotes the amount of jets from each event and takes values in \(\{0,1,2,3\}\).

  • All other variables are either direct or indirect measurements and are continuous.

Co-Dependency of Data

Since we are classifying, we want to remove any redundancies in our data, fortunately we have the following co-dependencies:

  • PRI_jet_all_pt = PRI_jet_leading_pt + PRI_jet_subleading_pt

  • PRI_lep_pt = DER_pt_ratio_lep_tau*PRI_tau_pt

  • Jet-related variables depend on PRI_jet_num (see next section for more)

We can either remove PRI_jet_all_pt or remove both PRI_jet_leading_pt and PRI_jet_subleading_pt to reduce the dimension by \(1\) or \(2\). This same idea can be applied to to the other (multiplicative) dependence. With these reductions we could remove up to \(4\) dimensions from our data.

If we wish to be more liberal with our removal, we could throw out all jet related variables besides PRI_jet_num to remove \(10\) more dimensions from our data. However this is assuming that the other jet variables have little impact on classification which has not been validated - yet.

Missing Data

What is missing and why?

Below are the variables which contain undefined values. By convention it was provided to use with values -999 which is out of range for every observation.

Notice that every column besides DER_mass_MMC is a jet variable, and in those, we only have two values for completion_rate. In fact these correspond to different values of PRI_jet_num:

  • 0 jets: All jet variables were missing data.

  • 1 jet: only PRI_leading_pt,PRI_leading_eta and PRI_leading_phi have data.

  • 2 or more jets: All jet variables have data.

The above result can be found in the handbook for the variables provided with the Kaggle challenge. Unfortunately DER_mass_MMC does not have such an explanation and may be a result of some event during measurement. However it is still assigned the same -999 value as the other missing data.

Impact of the Missing Data

Despite understanding the cause of (most of) our missing data, we still don’t know the impact, the following section is dedicated to understanding the correlation between the missing data from each variable and its classification. Take DER_mass_MMC for example, if we find that the missing data is distributed uniformly across \(y_i \in \{b,s\}\) then the appearance of NA has no impact on classification, however if it is disproportionately skewed towards \(b\) then we can use this to assist our classification.

Below we compute the percentage of NA data in Background and Signal respectively. If this missing data is distributed uniformly across Background and Signal then the percentages should be very close.

Notice that Signal consistently has \(13-18 \%\) less missing data than Background. This means that in both the DER_mass_MMC and jet-variables cases, the amount of missing data is indicative of a classification. Meaning that including -999 (or some other extreme value) can help with classification.

Variable Analysis

The Distributions

To get an understanding of the behavior of variable given their classification, we plotted their densities. The entire list can be found in the appendix but here are a few of note:

Note that the data for these plots the variables have NA instead of -999 for missing data, again see appendix for plots with -999 included. Some key takeaways from these plots are:

  • Many of our parameters are over the range \([-\pi,\pi]\) as they correspond to the angles observed in interactions.

  • We see that some these angle variables: PRI_tau_phi,PRI_lep_phi, PRI_met_phi,PRI_jet_leading_phiand PRI_jet_subleading_phi are uniformly distributed and that there is very little variation between the classifications. Additionally all except PRI_jet_subleading_phi are unaffected when we include the -999 values. This leads us to consider rejecting them from our classification, reducing the dimension by up to 5.

  • The parameters PRI_tau_eta and PRI_lep_eta have exotic distributions which vary with \(b\) and \(s\)

  • We have many single peak distributions such as DER_mass_MMC,DER_mass_transverse_met_lep and DER_pt_ratio_lep_tau which have clear variation between the classes.

  • We can see that the Jet-related variables have clear variation between the classes, reinforcing our need to use them and their corresponding -999 values.

The Mutual Information between \(b\) and \(s\)

While plots are very pretty, we must have a justifiable mathematical result to validate any variable selection. Below we plot the mutual information between \(V \mid y='s'\) and \(V \mid y= 'b'\) respectively.

Below we plot the joint and product distributions for SOME_VARIABLE as a visual aid to how the mutual information is computed.

As you can see for GOOD_EXAMPLE we can see a large variation between the joint and product distributions, whereas with BAD_EXAMP we have almost identical distributions.

Below we display the mutual information of all of our variables.

Fisher Discriminant Analysis

FDA creates an embedding that tells us how linearly separated our data can possibly be by maximizing between class scatterness and minimizing within class scatterness. This can be a useful aid to find out whether a linear classifier is appropriate and can additionally be used to see how removing variables affects the separability of our data. We created two new functions in order to perform FDA: fisher_discrim to find the embedding vector that maximises the ratio of between- to within-class scatterness and scatter_ratio to calculate this ratio for a given dataset and embedding vector. Code for these functions can be found in our R package higgiesmalls.

We consider four increasing cases of variables elimination: 1) Including all variables; 2) Removing all 4 algebraic co-dependencies; 3) Removing the above and any uniform variables; 4) Removing the above and the 10 variables with lowest mutual information. The following plots show the results of the FDA embedding for each of these cases.

For these different FDA embeddings, we can compare the ratio between between class scatterness and within class scatterness:

We find that removing variables slightly lowers the linear separability of the data, but in general the values are comparable. Using only the 10 variables with the highest mutual information, however, yields a lower scatter ratio. Based on the plots, we can conclude that it is not possible to achieve a high degree of linear separability between the signal and background events, indicating that a feature transform is most likely appropriate.

Principle Component Analysis

Support Vector Machine

We divided our dataset into a training and testing split, training SVMs on the training data and finding the AMS on the testing data.

## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(mut_info_vars)
## 
##   # Now:
##   data %>% select(all_of(mut_info_vars))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

By training SVMs with a linear kernel vs. a radial kernel on a 20% training set (using all predictor variables), we found that a radial kernel consistently yielded higher AMS:

Based on these results, along with the results of FDA indicating that the dataset is not linearly separable, we moved forward using a radial kernel. We performed 5-fold cross-validation over the 20% training set to tune the two hyperparameters using a grid search: cost and \(\gamma\) (bandwidth). We allowed cost to take on the values 0.5, 1, and 2 and \(\gamma\) to take on the values 0.01, 0.1, and 0.5. Tuning was performed using our function ams_tune_parallel, which modifies the e1071::tune function to maximise AMS and to conduct cross-validation in parallel, dramatically improving the performance of tuning compared to the original package.

Below are the results of the hyperparameter tuning using the three sets of variables: 1) Including all variables; 2) Removing the algebraic co-dependencies and uniformly distributed variables; 3) Using only the 10 variables with the highest mutual information.

# SVM using all variables:
svm_radial_tune_all <- readRDS(here("output/svm_radial_tune_20_gc.RDS"))
svm_radial_tune_all$best.parameters
##   cost gamma
## 5    1   0.1
# SVM without co-dependencies and uniform variables
svm_radial_tune_drop_codep_unif <- readRDS(here("output/svm_radial_tune_drop_codep_unif_20_gc.RDS"))
svm_radial_tune_drop_codep_unif$best.parameters
##   cost gamma
## 6    2   0.1
# SVM using top 10 mutual information
svm_radial_tune_mut_info <- readRDS(here("output/svm_radial_tune_mut_info_20.RDS"))
svm_radial_tune_mut_info$best.parameters
##   cost gamma
## 9    2   0.5

We then used the model using the best hyperparameter values from tuning to find the AMS value on the 80% testing dataset.

Appendix

Distributions with NA entries

Joint Distributions with NA entries

Distributions with -999 entries